Current OpenSM Routing
7/18/06

OpenSM offers two routing engines:

1.  Min Hop Algorithm - based on the minimum hops to each node where the 
path length is optimized.

2.  UPDN Unicast routing algorithm - also based on the minimum hops to each 
node, but it is constrained to ranking rules. This algorithm should be chosen 
if the subnet is not a pure Fat Tree, and deadlock may occur due to a 
loop in the subnet.

OpenSM now also offers a file method which can load routes from a table. See 
modular-routing.txt for more information on this.

The basic routing algorithm is comprised of two stages:
1. MinHop matrix calculation
   How many hops are required to get from each port to each LID ?
   The algorithm to fill these tables is different if you run standard
(min hop) or Up/Down.
   For standard routing, a "relaxation" algorithm is used to propagate
min hop from every destination LID through neighbor switches
   For Up/Down routing, a BFS from every target is used. The BFS tracks link
direction (up or down) and avoid steps that will perform up after a down
step was used.

2. Once MinHop matrices exist, each switch is visited and for each target LID a
decision is made as to what port should be used to get to that LID.
   This step is common to standard and Up/Down routing. Each port has a
counter counting the number of target LIDs going through it.
   When there are multiple alternative ports with same MinHop to a LID,
the one with less previously assigned ports is selected. 
   If LMC > 0, more checks are added: Within each group of LIDs assigned to 
same target port, 
   a. use only ports which have same MinHop 
   b. first prefer the ones that go to different systemImageGuid (then
the previous LID of the same LMC group)
   c. if none - prefer those which go through another NodeGuid
   d. fall back to the number of paths method (if all go to same node).


Effect of Topology Changes

OpenSM will preserve existing routing in any case where there is no change in
the fabric switches unless the -r (--reassign_lids) option is specified.

-r
--reassign_lids
          This option causes OpenSM to reassign LIDs to all
          end nodes. Specifying -r on a running subnet
          may disrupt subnet traffic.
          Without -r, OpenSM attempts to preserve existing
          LID assignments resolving multiple use of same LID.

If a link is added or removed, OpenSM does not recalculate
the routes that do not have to change. A route has to change 
if the port is no longer UP or no longer the MinHop. When routing changes 
are performed, the same algorithm for balancing the routes is invoked.

In the case of using the file based routing, any topology changes are
currently ignored The 'file' routing engine just loads the LFTs from the file 
specified, with no reaction to real topology. Obviously, this will not be able 
to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent 
switches will be skipped. Multicast is not affected by 'file' routing engine 
(this uses min hop tables).


Min Hop Algorithm

The Min Hop algorithm is invoked when neither UPDN or the file method are
specified.
 
The Min Hop algorithm is divided into two stages: computation of
min-hop tables on every switch and LFT output port assignment. Link
subscription is also equalized with the ability to override based on
port GUID. The latter is supplied by:

-i <equalize-ignore-guids-file>
-ignore-guids <equalize-ignore-guids-file>
          This option provides the means to define a set of ports
          (by guids) that will be ignored by the link load
          equalization algorithm.

LMC awareness routes based on (remote) system or switch basis.


Purpose of UPDN Algorithm

The UPDN algorithm is designed to prevent deadlocks from occurring in loops 
of the subnet. A loop-deadlock is a situation in which it is no longer 
possible to send data between any two hosts connected through the loop. As 
such, the UPDN routing algorithm should be used if the subnet is not a pure 
Fat Tree, and one of its loops may experience a deadlock (due, for example, 
to high pressure).

The UPDN algorithm is based on the following main stages:

1.  Auto-detect root nodes - based on the HCA hop length from any switch in 
the subnet, a statistical histogram is built for each switch (hop num vs 
number of occurrences). If the histogram reflects a specific column (higher
than others) for a certain node, then it is marked as a root node. Since 
the algorithm is statistical, it may not find any root nodes. The list of 
the root nodes found by this auto-detect stage is used by the ranking 
process stage.

    Note 1: The user can override the node list manually.
    Note 2: If this stage cannot find any root nodes, and the user did not 
            specify a guid list file, OpenSM defaults back to the Min Hop 
            routing algorithm.

2.  Ranking process - All root switch nodes (found in stage 1) are assigned 
a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the 
subnet are ranked incrementally. This ranking aids in the process of enforcing 
rules that ensure loop-free paths.

3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from 
each (HCA or switch) node in the subnet. During the BFS process, the FDB table 
of each switch node traversed by BFS is updated, in reference to the starting 
node, based on the ranking rules and guid values.

At the end of the process, the updated FDB tables ensure loop-free paths 
through the subnet.


UPDN Algorithm Usage

Activation through OpenSM

Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
Use `-a <guid_list_file>' for adding an UPDN guid file that contains the 
root nodes for ranking.
If the `-a' option is not used, OpenSM uses its auto-detect root nodes 
algorithm.

Notes on the guid list file:
1.   A valid guid file specifies one guid in each line. Lines with an invalid 
format will be discarded.
2.   The user should specify the root switch guids. However, it is also 
possible to specify HCA guids; OpenSM will use the guid of the switch (if 
it exists) that connects the HCA to the subnet as a root node.


To learn more about deadlock-free routing, see the article 
"Deadlock Free Message Routing in Multiprocessor Interconnection Networks" 
by William J Dally and Charles L Seitz (1985).

